Final Data Science Project

Name : KIMAYA KHILARE

NUID : 002958773

IMAGE CLASSIFICATION AND NOISE DETECTION

Abstract :

In this project, I aim to take up couple of hundreds of images and try working on them by reducing the noise and detect them. All devices have signals that can be analog or digital have noise. Noises can be of various types. Removal of noise from a signal is often abbreviated as noise reduction technique. These techniques are applicable to only image and audio files. This process is carried out to enhance the image quality. There can be different types of imperfect pixels like the ones which are stuck that needs to be minimized to improve the signals. This minimization leads to better sharpness of the image to the human eye.

A classification algorithm will be running on the images to find out the image distribution of images according to their class labels. This is a type of feature engineering where images are sorted according to their labels. Such a classification process was run for both the datasets. I will be further performing the Classification Algorithms like Random Forest, Support Vector Machine, Logistic Regression where I will be performing comparative analysis

For the noise reduction module, I am using Variational Autoencoders technique. Variational Autoencoders is a Neural Network architecture

Dataset : https://github.com/fastai/imagenette/tree/imagenette-noise

PART 1: DATA EXPLORING

IMPORTING LIBRARIES

Importing the dataset using Imagefolder and dataloader. ImageFolder class supports a powerful feature in composing the batch dataset. In most cases, when we build the batch dataset, arranging input data and its corresponding label in pairs is done manually. With the ImageFolder, however, this can be done much easier if the dataset is composed of images. DataLoader of torch.utils.data package is what actually returns the batch given the transformations and data directory that I set with the above Transform and ImageFolder class.

The dataset consists of 13,394 distinct images. The data is further divided into training and testing dataset. The ratio for training and testing data is set to be 70% and 30% respectively. So, training dataset has 9469 files and testing dataset has 3925 files.

Importing the testing data using ImageFolder and DataLoader

Function to get shape(Width,height) and labels of the image

Detecting Outliers

Here, I am detecting the outliers and plotting the Train dataset Image size distribution

Plotting the Test Image Size distribution

It is quite evident from the above plot that the images are given different labels according to their size differences.

The driving factors of an image is the width and height. These two factors are usually the influencing factors of any image. A box plot was created on basis of these characteristics. Outliers can also be detected from the same.

It is quite evident from the above boxplot that there are many outliers for each feature. For training dataset, the outliers for width are usually between the range of 1200 to 3000 pixels with one exception above 4000 pixels. On contrary, the outliers for height ranges between 1000 pixels and 3700 pixels with a couple of exceptions being above 4000 pixels. For testing dataset, the outliers for width are usually between the range of 1200 to 2500 pixels with couple of exception above 3000 pixels. On contrary, the outliers for height ranges between 1000 pixels and 3000 pixels with a couple of exceptions being above 3000 pixels.

Normalization

Normalization is a technique for creating datasets for the artificial intelligence problem statements. It generally alters the range of vector intensity data. Over here, the images have also under gone data normalization. PyTorch is the implemented function which normalizes images on the fly. Normalization is done according to torch vision recommendation which is consistent for all the pretrained modules with mean [0.485, 0.456, 0.406] and std [0.229, 0.224, 0.225].

Normalization is an essential step as it ensures uniform distribution of pixel data. Later, during the training of the network it results in fast convergence. This is equivalent to a zero centered gaussian curve. The vector range is from 0 to 255 which is recommended to be normalized between 0 and 1. This will ensure that all the variable data is unified into one data range for further easier computations.

Image Class Distribution

A classification program is run on the images to find out the image distribution of images according to their class labels. This is a type of feature engineering where images are sorted according to their labels. Such a classification process was run for both the datasets.

As you can observe in train dataset, the class frequency is almost same for all class labels. Only class with label 3 has slightly low class frequency as compared to other labels.

From the above two figures, it remarkably noticeable that there is not extreme difference between class distributions. Both the datasets have 10 class labels ranging from 0 to 9.

Data Preprocessing Part

Add Grayscaling so we have less features

Add Resizing the images to the pipeline

Now we need to flatten images. Basically, I need to convert image of size 224x224 to a flat matrix with a lot of features

Here by, I have got the flatten images through above transforms

Create test dataset with the same approach

Now, we have train and test datasets. However they have 50176 features which looks like too much for tree based models. We need to apply dimensionality reduction technique.

PCA

There were 13,394 images in the whole dataset. These images were further divided into training and testing dataset. Training dataset contained 9469 images while testing dataset contained 3925 images. Both the dataset has images with 50176 features. So, using the reshape function from NumPy library, the dimension of the images which were too much for tree based models were reduced principal component analysis. PCA is a well-known unsupervised dimensionality reduction approach to lower dimensional space. We used PCA since we were confronted with the curse of dimensionality with 51706 features. So, from 51076 features, we reduced it to 10 PC components using the PCA algorithm. The images were normalized using gray scaling technique and flattening of images was also performed. This helped to convert multi-dimensional array into single dimensional array. Resizing the images to the pipeline, we converted an image of size 224x224 to a flat matrix with a lot of features using the flatten() function. We are dealing with a dataset which contains a large amount of images thus flattening helps in decreasing the memory as well as reducing the time to train the model.

First we need to concatenate train and test splits to perform PCA in one time

We will save the data as csv table

Modelling and their Interpretations

Logistic Regression with hyper parameters "penalty=l2", solver ="lblgs" and "C=0.01"

Below, I am plotting the Confusion Matrix for the dataset to predict the accuracy of the model

Observing the heatmap, sufficient portion of the images are getting correctly classified as seen in the diagonals. Class labels 1,2,6,9 show higher accuracy in being classified whereas class label 3 and 7 have very low classification accuracy.

RMSE

Mean square error(MSE)

As we can infer, the accuracy by auc score which is near 70% and considered to be a good accuracy

Random Forest

Performing GridSearch for Hyperparameters "n_estimators" and "max_features"

sqrt parameter takes less square root of total number of features in single run as compared to log2. sqrt is generally preferred for classification problem whereas log2 is used for regression problem. So, sqrt is smaller than log2 and smaller the feature the lesser the overfitting. Therefore, considering this is a classification problem, max_features = sqrt yields better output as compared to max_features = log2.

RMSE

MSE

The AUC values we got was (0.74). This means that almost 74% of the predictions are going to be correct.

SVM (Support Vector Machine)

Support vector machine (SVM) is a supervised machine learning model that uses classification techniques. SVM models can categorize new text after being given sets of labeled training data for each category. It is a supervised machine learning technique that can be used to solve problems like classification and regression. It transforms your data using a technique known as the kernel trick, and then calculates an ideal boundary between the available outputs based on these alterations. The hyper parameters used for this model are as follows:

Performing Grid Search with Hyper Parameters 'C','gamma' and 'kernel'

The AUC values we got was (0.55). This means that almost 55% of the predictions are going to be correct.

VAE Noising and Denoising

Variational Autoencoders is a Neural Network architecture, which consists of two parts: encoder and decoder. Encoder compresses the image to a vector with latent variables, while decoder tries to reconstruct the original image given the latent vector. Data is going to be fed to a variational auto encoder, then after compressing and decompressing, a comparison between the output image and original non-noisy image will be conducted. To train such a model we need to use a reconstruction loss function which calculates the difference between a reconstructed image and non-noisy original image. In this case model will learn how it avoid encoding noise along with the rest of the image and get rid of it in the latent vector. Every time model is penalized for producing noisy image, it will learn more and more of how to get rid of this irrelevant information.

For this part, I am selecting the subset of the orignal dataset i.e. Imagenett. The dataset has only 10 labels which are defined in the next line.

I will first define a List of the Labels

Function to generate random noise and add the generated noise to the image

Importing torch and Pytorch. Preprocessing the dataset for training by applying transforms like Resizing and Normalizing.

Function to plot the batch of image and their corresponding labels.

I will plot a batch of images from the dataset with the help of example_batch function.

I will now plot the batch of Same Images added with noise

This is an original image

This is the image which contains the noise which I have been added

This above image is the noise predicted by the model

This above image is the restructed image after subtracting the noise from the noisy image.

Conclusion

I successfully performed the Image Classification and Noise Reduction in this notebook. Various Images from the Imagenette dataset were classified according to their class labels.I used three Models which are as follows: 1)Logistic Regression 2)Random Forests 3)Support Vector Machines These models were trained and tested on the dataset. I computed the metrics which were used to compare the models were Accuracy, Logloss,RMSE,MSE and AUC scores. On analysis it was observed that the best results were obtained for Random Forests with an accuracy of about 70%. In the second part I demonstrated image noising and denoising with the help of VAE(Variational Auto Encoder) model.Random Gaussian noise was added to the images to get noisy version of the images and this noise was predicted by the VAE model. On subtracting this noise from the noisy images, we were able to reconstruct the original image.

Reference

  1. https://medium.com/jun94-devpblog/pytorch-1-transform-imagefolder-dataloader-7f75f0a460c0
  2. https://github.com/christianversloot/machine-learning-articles/blob/main/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn.md
  3. Team, T. (2021, July 6). SVM in R for Data Classification using e1071 Package. TechVidvan. https://techvidvan.com/tutorials/svm-in-r/
  4. https://www.datacamp.com/tutorial/understanding-logistic-regression-python
  5. https://github.com/dojoteef/dvae
  6. Kumar, N., Kumar, N., & Profile, V. M. C. (2019, February 23). Advantages and Disadvantages of Random Forest Algorithm in Machine Learning. The Professionals Point. http://theprofessionalspoint.blogspot.com/2019/02/advantages-and-disadvantages-of-random.html 7.https://towardsdatascience.com/which-evaluation-metric-should-you-use-in-machine-learning-regression-problems-20cdaef258e

License

Copyright> (Year:-2022) <Name:Kimaya Kishor Khilare>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.